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OPTIMAL RATES OF CONVERGENCE FOR SPARSE 
COVARIANCE MATRIX ESTIMATION 

By T. Tony Cai^ and Harrison H. Zhou^ 

University of Pennsylvania and Yale University 

This paper considers estimation of sparse covariance matrices and 
establishes the optimal rate of convergence under a range of matrix 
operator norm and Bregman divergence losses. A major focus is on 
the derivation of a rate sharp minimax lower bound. The problem 
exhibits new features that are significantly different from those that 
occur in the conventional nonparametric function estimation prob- 
lems. Standard techniques fail to yield good results, and new tools 
are thus needed. 

We first develop a lower bound technique that is particularly well 
suited for treating "two-directional" problems such as estimating 
sparse covariance matrices. The result can be viewed as a general- 
ization of Le Cam's method in one direction and Assouad's Lemma 
in another. This lower bound technique is of independent interest and 
can be used for other matrix estimation problems. 

We then establish a rate sharp minimax lower bound for estimat- 
ing sparse covariance matrices under the spectral norm by applying 
the general lower bound technique. A thresholding estimator is shown 
to attain the optimal rate of convergence under the spectral norm. 
The results are then extended to the general matrix l^n operator 
norms for 1 < w < oo. In addition, we give a unified result on the 
minimax rate of convergence for sparse covariance matrix estimation 
under a class of Bregman divergence losses. 

1. Introduction. Minimax risk is one of the most widely used bench- 
marks for optimality, and substantial efforts have been made on developing 
minimax theories in the statistics literature. A key step in establishing a min- 
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imax theory is the derivation of minimax lower bounds and several effective 
lower bound arguments based on hypothesis testing have been introduced in 
the literature. Well-known techniques include Le Cam's method, Assouad's 
lemma and Fano's lemma. See Le Cam (1986) and Tsybakov (2009) for more 
detailed discussions on minimax lower bound arguments. 

Driven by a wide range of applications in high dimensional data analysis, 
estimation of large covariance matrices has drawn considerable recent atten- 
tion. See, for example, Bickel and Levina (2008a, 2008b), El Karoui (2008), 
Ravikumar et al. (2008), Lam and Fan (2009), Cai, Zhang and Zhou (2010) 
and Cai and Liu (2011). Many theoretical results, including consistency and 
rates of convergence, have been obtained. However, the optimality question 
remains mostly open in the context of covariance matrix estimation under 
the spectral norm, mainly due to the technical difficulty in obtaining good 
minimax lower bounds. 

In this paper we consider optimal estimation of sparse covariance matrices 
and establish the minimax rate of convergence under a range of matrix op- 
erator norm and Bregman divergence losses. A major focus is on the deriva- 
tion of a rate sharp lower bound under the spectral norm loss. Conventional 
lower bound techniques such as the ones mentioned earlier are designed and 
well suited for problems with parameters that are scalar or vector-valued. 
They have achieved great successes in solving many nonparametric function 
estimation problems which can be treated exactly or approximately as esti- 
mation of a finite or infinite dimensional vector and can thus be viewed as 
"one-directional" in terms of the lower bound arguments. In contrast, the 
problem of estimating a sparse covariance matrix under the spectral norm 
can be regarded as a truly "two-directional" problem where one direction 
is along the rows and another along the columns. It cannot be essentially 
reduced to a problem of estimating a single or multiple vectors. As a conse- 
quence, standard lower bound techniques fail to yield good results for this 
matrix estimation problem. New and more general technical tools are thus 
needed. 

In the present paper we first develop a minimax lower bound technique 
that is particularly well suited for treating "two-directional" problems such 
as estimating sparse covariance matrices. The result can be viewed as a 
simultaneous generalization of Le Cam's method in one direction and As- 
souad's lemma in another. This general technical tool is of independent 
interest and is useful for solving other matrix estimation problems such as 
optimal estimation of sparse precision matrices. 

We then consider specifically the problem of optimal estimation of sparse 
covariance matrices under the spectral norm. Let Xi, . . . be a random 
sample from a p-variate distribution with covariance matrix Yj — (^f^ij^Kij<p- 
We wish to estimate the unknown matrix T, based on the sample {Xi, . . . , X„} 
In this paper we shall use the weak iq ball with < q < 1 to model the 
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sparsity of the covariance matrix S. The weak (.q ball was originally used in 
Abramovich et al. (2006) for a sparse normal means problem. A weak £q ball 
of radius c in contains elements with fast decaying ordered magnitudes 
of components, 

B^{c) = {e G M"* : leij,) < ck-\ for all A; = 1, ... , m}, 

where denotes the kih. largest element in magnitude of the vector ^. 

For a covariance matrix S = {crij)i<ij<p, denote by cr~jj the jth column 
of S with ajj removed. We shall assume that cr-jj is in a weak £q ball for 
all ^ l^j ^P- More specifically, for <q <1, we define the parameter space 
Gq{cn,p) of covariance matrices by 

(1) Qq{Cn,p) = {S = (0-ij)l<i,j<p : CF-j,j G BP~^{Cn,p), 1 < j < p} ■ 

In the special case of q = 0, a matrix in ^o(cn,p) has at most c„^p nonzero 
off-diagonal elements on each column. 

The problem of estimating sparse covariance matrices under the spectral 
norm has been considered, for example, in El Karoui (2008), Bickel and 
Levina (2008b), Rothman, Levina and Zhu (2009) and Cai and Liu (2011). 
Thresholding methods were introduced, and rates of convergence in proba- 
bility were obtained for the thresholding estimators. The parameter space 
Gqicn,p) given in (1) also contains the uniformity class of covariance matrices 
considered in Bickel and Levina (2008b) as a special case. We assume that 
the distribution of the Xj's is subgaussian in the sense that there is r > 
such that 

(2) P{|v^(Xi -EXi)| >t} <e-*'/(2r) for alH>0 and ||v||2 = l. 

Let 'Pq{T,Cn,p) denote the set of distributions of Xi satisfying (2) and with 
covariance matrix S G Gq{cn,p)- 

Our technical analysis used in establishing a rate-sharp minimax lower 
bound has three major steps. The first step is to reduce the original prob- 
lem to a simpler estimation problem over a carefully chosen subset of the 
parameter space without essentially decreasing the level of difficulty. The 
second is to apply the general minimax lower bound technique to this sim- 
plified problem, and the final key step is to bound the total variation affinities 
between pairs of mixture distributions with specially designed sparse covari- 
ance matrices. The technical analysis requires ideas that are quite different 
from those used in the typical function/sequence estimation problems. 

The minimax upper bound is obtained by studying the risk properties of 
thresholding estimators. It will be shown that the optimal rate of conver- 
gence under mean squared spectral norm error is achieved by a thresholding 
estimator introduced in Bickel and Levina (2008b). We write x bn if there 
are positive constants c and C independent of n such that c < an/bn < C. 
For 1 <w<oo, the matrix iy^ operator norm of a matrix A is defined by 
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|||^|||i„ = max||2.||^=x ||Ax||t„. The commonly used spectral norm ||| • |j| coin- 
cides with the matrix I2 operator norm ||| • |||2. (Throughout the paper, we 
shall write ||| • ||| without a subscript for the matrix spectral norm.) For a 
symmetric matrix ^, it is known that the spectral norm |||^||| is equal to the 
largest magnitude of the eigenvalues of A. Throughout the paper we shall 
assume that 1 < < p for some constants /3 > 1. Combining the results 
given in Sections 3 and 4, we have the following optimal rate of convergence 
for estimating sparse covariance matrices under the spectral norm. 

Theorem 1. Assume that 



for <q <1. The minimax risk of estimating the covariance matrix S under 
the spectral norm over the class Vq{T,Cn,p) satisfies 



where 9 denotes a distribution in Vq{T,Cn,p) with the covariance matrix S. 
Furthermore, (4) holds under the squared 1^ operator norm loss for all 1 < 
w < 00. 

We shall focus the discussions on the spectral norm loss. The extension 
to the general matrix operator norm is given in Section 6. In addition, 
we also consider optimal estimation under a class of Bregman matrix diver- 
gences which include Stein's loss, squared Frobenius norm and von Neumann 
entropy as special cases. Bregman matrix divergences provide a flexible class 
of dissimilarity measures between symmetric matrices and have been used 
for covariance and precision matrix estimation as well as matrix approxima- 
tion problems. See, for example, Dhillon and Tropp (2007), Ravikumar et 
al. (2008) and Kulis, Sustik and Dhihon (2009). We give a unified result on 
the minimax rate of convergence in Section 5. 

Besides the sparsity assumption considered in this paper, another com- 
monly used structural assumption in the literature is that the covariance 
matrix is "bandable" where the entries decay as they move away from the 
diagonal. This is particularly suitable in the setting where the variables ex- 
hibit a certain ordering structure, which is often the case for time series data. 
Various regularization methods have been proposed and studied under this 
assumption. Bickel and Levina (2008a) proposed a banding estimator and 
obtained rate of convergence for the estimator. Cai, Zhang and Zhou (2010) 
established the minimax rates of convergence and introduced a rate-optimal 
tapering estimator. In particular, Cai, Zhang and Zhou (2010) derived rate 
sharp minimax lower bounds for estimating bandable matrices. It should be 
noted that the lower bound techniques used there do not lead to a good 
result for estimating sparse covariance matrices under the spectral norm. 



(3) 



Cn,p < Mn(l"«)/2(iQg^)-(3-g)/2 
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The rest of the paper is organized as follows. Section 2 introduces a gen- 
eral technical tool for deriving minimax lower bounds on the minimax risk. 
Section 3 establishes the minimax lower bound for estimating sparse covari- 
ance matrices under the spectral norm. The upper bound is obtained in Sec- 
tion 4 by studying the risk properties of thresholding estimators. Section 5 
considers optimal estimation under the Bregman divergences. A uniform op- 
timal rate of convergence is given for a class of Bregman divergence losses. 
Section 6 discusses extensions to estimation under the general £^ norm for 
1 < w < oo and connections to other related problems including optimal 
estimation of sparse precision matrices. The proofs are given in Section 7. 

2. General lower bound for minimax risk. In this section we develop a 
new general minimax lower bound technique that is particularly well suited 
for treating "two-directional" problems such as estimating sparse covariance 
matrices. The new method can be viewed as a generalization of both Le 
Cam's method and Assouad's lemma. To help motivate and understand the 
new lower bound argument, it is useful to briefly review Le Cam's method 
and Assouad's lemma. 

Le Cam's method is based on a two-point testing argument and is par- 
ticularly well used in estimating linear functionals. See Le Cam (1973) and 
Donoho and Liu (1991). Let X be an observation from a distribution ¥$ 
where 9 belongs to a parameter set G. For two distributions P and Q with 
densities p and q with respect to any common dominating measure /i, the to- 
tal variation affinity is given by ||P AQ|| = f pAqdfi. Le Cam's method works 
with a finite parameter set @ = {6q, 6i, . . . ,6£)}. Let L be a loss function. De- 
fine /min = mini<i<z)inff [L(t,6'o) + L{t,9i)] and denote P= ^Yl^i^i^Oi- Le 
Cam's method gives a lower bound for the maximum estimation risk over 
the parameter set 0. 

Lemma 1 (Le Cam). Let T be any estimator of 9 based on an observation 
X from a distribution P51 with 9 = {9q,9i, . . . , then 

(5) supExie^(r,^) > ^UnllP^o AP||. 

Write 01 = {^1, . . . , One can view the lower bound in (5) as obtained 
from testing the simple hypothesis -f^o - 9 = 9q against the composite alter- 
native Hi:9£@i. 

Assouad's lemma works with a hypercube = {0,1}^. It is based on 
testing a number of pairs of simple hypotheses and is connected to multiple 
comparisons. For a parameter 9 = (9i, . . . ,9r) where 9i € {0,1}, one tests 
whether = or 1 for each 1 <i <r based on the observation X . For each 
pair of simple hypotheses, there is a certain loss for making an error in the 
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comparison. The lower bound given by Assouad's lemma is a combination 
of losses from testing all pairs of simple hypotheses. Let 

r 

(6) H{e,e') = Y,\^^-o'^\ 

1=1 

be the Hamming distance on Q. Assouad's lemma gives a lower bound for 
the maximum risk over the hypercube G of estimating an arbitrary quantity 
^/'(^) belonging to a metric space with metric d. 

Lemma 2 (Assouad). Let X ~ with 6l G 6 = {0, 1}^ and let T = T{X) 
he an estimator of ip{9) based on X. Then for all s> 0, 

max2"Exi0ci'(T,V(^)) 

(7) 

d'{Me),Me')) r „^ ^ „ 

> min , • - ■ min PeAPg/ . 

We now introduce our new lower bound technique. Again, let X ~ Pg 
where 9 € Q. The parameter space of interest has a special structure 
which can be viewed as the Cartesian product of two components T and A. 
For a given positive integer r and a finite set B CM^\ {Oixp}, let T = {0, 1}*" 
and ACS'-. Define 

(8) G = roA = {0 = (7,A):7Gr and AeA}. 

In comparison, the standard lower bound arguments work with either F or 
A alone. For example, Assouad's lemma considers only the parameter set F 
and the Le Cam's method typically applies to a parameter set like A with 
r = 1. For 9 = (7, A) G 0, denote the projection of to F by ^{9) = 7 and to 
A by A(^) = A. 

It is important to understand the structure of the parameter space 0. 
One can view an element A S A as an r x p matrix with each row coming 
from the set B and view F as a set of parameters along the rows indicating 
whether a given row of A is present or not. Let D\ = Card(A). For a given 
a G {0, 1} and l<i<r, denote Qi^a = {9 €&: 7,(6') = a} where 9 = (7, A) 
and 7i(0) is the ith coordinate of of the first component of 9. It is easy to 
see that Card(0i^a) = 2^~^D\. Define the mixture distribution Pj^a by 

(9) ^•'.'^=^^ E Jp^- 

So Pj^a is the mixture distribution over all Pg with ^i{9) fixed to be a while 
all other components of 9 vary over all possible values in Q. 

The following lemma gives a lower bound for the maximum risk over the 
parameter set Q of estimating a functional ip{9) belonging to a metric space 
with metric d. 
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Lemma 3. For any s > and any estimator T of ip{9) based on an 
observation from the experiment {Pg;^ €E 0} where Q is given in (8), 



(10) max2^Ex|0d^(T,V'(0))>a^ min ||P,,oAP,,i| 

© Z l<i<r 



where Pj^a is defined in equation (9) and a is given by 
(11) «= ™™ TT/ in\ — 77^7^- 

{{e,e'):H(^(e)Me'))>i} H{-^{e),-i{9')) 

The idea behind this new lower bound argument is similar to the one for 
Assouad's lemma, but exists in a more complicated setting. Based on an 
observation X ^Vq where = (7, A) G = F (8> A, we wish to test whether 
7j = or 1 for each 1 <i <r. The first factor a in the lower bound (10) is the 
minimum cost of making an error per comparison. The second factor r/2 is 
the expected number of errors one makes to estimate 7 when Pg and Vqi are 
indistinguishable from each other in the case H{'y{6),^{6')) = r, and the last 
factor is the lower bound for the total probability of making type I and type 
II errors for each comparison. A major difference is that in this third factor 
the distributions Pj.o and Pj^i are both complicated mixture distributions 
instead of the typically simple ones in Assouad's lemma. This makes the 
lower bound argument more generally applicable, while the calculation of 
the affinity becomes much more difficult. 

In applications of Lemma 3, for a 7 = (71,..., 7,,) € T where 7^ takes 
value or 1, and a A = (Ai, . . . , A,.) € A where each Aj € i? is a p-dimensional 
nonzero row vector, the element = (7, A) G can be equivalently viewed 
as an r X p matrix 

/7i-Ai\ 
72 ■ A2 

(12) 

V 7r • Ar- / 

where the product 7^ • Aj is taken elementwise: 7i • Aj = Aj if 7^ = 1 and the 
ith row of 9 is the zero vector if 7^ = 0. The term ||Pi,o A Pi,i|| of equation 
(10) is then the lower bound for the total probability of making type I and 
type II errors for testing whether or not the ith row of 9 is zero. 

Note that the lower bound (10) reduces to the classical Assouad lemma 
when A contains only one matrix for which every row is nonzero, and be- 
comes a two-point argument of Le Cam with one point against a mixture 
when r = 1. The proof of this lemma is given in Section 7. The technical 
argument is an extension of that of Assouad's lemma. See Assouad (1983), 
Yu (1997) and van der Vaart (1998). 

The advantage of this method is the ability to break down the lower bound 
calculations for the whole matrix estimation problem into calculations for 
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individual rows so that the overall analysis is simplified and more tractable. 
Although the tool is introduced here for the purpose of estimating a sparse 
covariance matrix, it is of independent interest and is expected to be useful 
for solving other matrix estimation problems as well. 

Bounding the total variation affinity between two mixture distributions 
in (10) is quite challenging in general. The following well-known result on 
the affinity is helpful in some applications. It provides lower bounds for the 
affinity between two mixture distributions in terms of the affinities between 
simpler distributions in the mixtures. 

Lemma 4. Let = YT=i '^i^i = YT=i '^i'^i where Wi>0 and 

m 

||Pm AQ^II > Vu;,||Pi AQ.II > min ||PiAQi||. 

^ — ' l<i<m 
i=l 

More specifically, in our construction of the parameter set for establishing 
the minimax lower bound, r is the number of possibly nonzero rows in the 
upper triangle of the covariance matrix, and A is the set of matrices with 
r rows to determine the upper triangle matrix. Recall that the projection 
of € 6 to r is ^{9) = 7 = (7i(^))i<i<r- and the projection of to A is 
X{9) = A = {Xi{9))i<i<r. More generally, for a subset ^ C {1,2, . . . ,r}, we 
define a projection of to a subset of T by 7a(^) = (7i(^))«6A- A particularly 
useful example of set A is 

{-i} = {I, . . . ,i - l,i + I, . . . ,r} 

for which 7{_j}.(0) = (71(0), ... ,7j_i(^),7j+i(0), 7^(0)) and in this case for 
convenience we set 7_j =7{„j}. Xa{9) and X-i{9) are defined similarly. We 
also define the set Aa = {Aa(^) : 9 € 0}. A special case is ^ = {—i}. 

Now we define a subset of G to reduce the problem of estimating to a 
problem of estimating Xi{9). For a G {0, 1}, b G {0, 1}^"^ and c G A_,j C B^'~^, 
let 

&(i,a,b,c) = {0^&■7^{0)= a, 7_,(0) = b and X-i{9) = c} 

and -D(j,b,c) = Card(0(j ^ (, c))- Note that the cardinality of @(^i^a,b,c) o^i f^LS 
right-hand side does not depend on the value of a due to the Cartesian 
product structure of = T Cgi A. Define the mixture distribution 

(13) h,a,b,c) = -Fr— E 

^{i,b,c) nal^ 

In other words, P(j,a,fe,c) is the mixture distribution over all ¥g with Xi{9) 
varying over all possible values while all other components of 9 remain fixed. 
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It is helpful to observe that when a = 0, we have 7i(0) • Xi{6) = for which 
^{i,a,b,c) is degenerate in the sense that it is an average of identical distribu- 
tions. 

Lemmas 3 and 4 together immediately imply the following result which 
is based on the total variation affinities between slightly less complicated 
mixture distributions. We need to introduce a new notation Eg to denote 
the average of a function g over Q, that is, 

The parameter 9 is seen uniformly distributed over Q. Let 

e_, = {o,ir-^0A_i 

= {(6,c) : 36* G G such that j-i{e) = b and X-i{e) = c}, 
and an average of h{^-i^\-i) over the set 0_j is defined as follows: 

where the distribution of (7--i,A_j) is induced by the uniform distribution 
over 0. 

Corollary 1. For any s>0 and any estimator T of'ip{9) based on an 
observation from the experiment {¥g,9 € 0} where the parameter space is 
given in (8), 

ma^rE^\gd'{T,i;{9)) 

(14) > min]E(^_^,^_,.) ||%o,7-„A_0 A P(.,i,7_,,a_.) II 

r - - 

(15) >Q-min min ||F(i,o,7_„A_,) '^^i.i-^.A^oH' 

where a andVi^a,b,c o-re defined in equations (11) and (13), respectively. 

Remark 1. A key technical step in applying Lemma 3 in a typical 
application is to show that the affinity ||Pj^o A Pj^i || is uniformly bounded 
away from by a constant for all i. Then the term ar on the right-hand 
side of equation (10) in Lemma 3 gives the lower bound for the minimax 
rate of convergence. As mentioned earlier, the affinity calculations for two 
mixture distributions can be very much involved. Corollary 1 gives two lower 
bounds in terms of the affinities. As noted earlier, P(i,o,7_i.A_i) ™- the affinity 
in equations (14) and (15) is in fact a single normal distribution, not a 
mixture. Thus the lower bounds given in equations (14) and (15) require 
simpler, although still involved, calculations. In this paper we will apply 
equation (14), which has an average of affinities on the right-hand side. 
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3. Lower bound for estimating sparse covariance matrix under the spec- 
tral norm. We now turn to the minimax lower bound for estimating sparse 
covariance matrices under the spectral norm. We shall apply the lower bound 
technique developed in the previous section to establish rate sharp results. 
The same lower bound also holds under the general norm for 1 < w < oo. 
Upper bounds are discussed in Section 4 and optimal estimation under Breg- 
man divergence losses is considered in Section 5. 

In this section we shall focus on the Gaussian case and wish to estimate 
the covariance matrix Spxp under the spectral norm based on the sample 

Xi, . . . , X,„ ''^^ N{fi, Hpxp)- The parameter space Gq{cn,p) for sparse covari- 
ance matrices is defined as in (1). In the special case of g = 0, Qo{cn,p) con- 
tains matrices with at most Cn^p + 1 nonzero elements on each row/column. 
The parameter space Gq{cn,p) also contains the uniformity class Qg{cn,p) 
considered in Bickel and Levina (2008b) as a special case, where Qgi^Cn^p^ is 
defined as, for < g < 1, 

(16) Qn{cn,p) = \^ = icrij)i<i,j<p: max |crij|'' < c„,p L 

The columns of S G Qq{cn,p) are assumed to belong to a strong iq ball. 

We now state and prove the minimax lower bound for estimating a sparse 
covariance matrix over the parameter space Qq{cn,p) under the spectral norm. 
The derivation of the lower bounds relies heavily on the general lower bound 
technique developed in the previous section. It also requires a careful con- 
struction of a finite subset of the parameter space and detailed calculations 
of an effective lower bound for the total variation affinities between mixtures 
of multivariate Gaussian distributions. 

Theorem 2. Lei Xi, . . . ,X„ ~ N{n,T,p^p). The minimax risk for es- 
timating the covariance matrix T, over the parameter space Gq{cn,p) with 
c„,p<Mn(^-'')/2(iogp)-{3-g)/2 satisfies 



(17) inf sup ^^^d\f^-nf>cUX^y~\'-^) 



l|2 ^ f 2 f^OgpY logp' 

s i;e5g(c„,p) 

for some constant c> 0, where ||| • ||| denotes the matrix spectral norm. 



Theorem 2 yields immediately a minimax lower bound for the more gen- 
eral subgaussian case under assumption (2), 

III 2 f 2 f^ogp\^ logp 
mf sup Ex|e|||S - >c[c^p[ ) + 



n / n 
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It has been shown in Cai, Zhang and Zhou (2010) that 

■ f W HIV Vll|2 ^ 

ml sup lUxielll^ ~ ^111 

by constructing a parameter space with only diagonal matrices. It then suf- 
fices to show that 

inf sup Ex|0|||S-S|P>c-4 fi^) 

E 9eVq[T,cn.p] \ n J 

to establish Theorem 2. 

The proof of Theorem 2 contains three major steps. In the first step we 
construct in detail a finite subset J-"* of the parameter space Qq{cn,p) such 
that the difficuhy of estimation over is essentiahy the same as that of es- 
timation over Qq{cn,p)- The second step is the application of Lemma 3 to the 
carefully constructed parameter set J>. Finally in the third step we calcu- 
late the factor a defined in (11) and the total variation affinity between two 
multivariate normal mixtures. Bounding the affinity is technically involved. 
The main ideas of the proof are outlined here, and detailed proofs of some 
technical lemmas used here are deferred to Section 7. 



Proof of Theorem 2. The proof is divided into three main steps. 

Step 1: Constructing the parameter set. Let r = [p/2j, where [x\ denotes 
the largest integer less than or equal to x, and let B be the collection of 
all row vectors b = {vj)i<j<p such that Vj = for I < j < p — r and vj = 
or 1 for p — r + l<j<p under the constraint the total number of Is is 
||6||o = k, where the value of k will be specified later. We shall treat each 
{bi, . . . ,br) £ as an r X p matrix with the ith row equal to 6j. 

Set r = {0, 1}''. Define A C B^ to be the set of all elements in B^ such 
that each column sum is less than or equal to 2k. For each component Am, 
1 < m < r, of A = (Ai, . . . , A,-) € A, define a p x p symmetric matrix ^^(Am) 
by making the ?nth row of Am{)^m) equal to Am, the mth column equal to 
Am and the rest of the entries 0. Note that for each A = (Ai, . . . , A,.) € A, 
each column/row sum of the matrix Ylm=i^m{^m) is less than or equal to 
2k. 

Define 

(18) e = r(g)A, 

and let e„^p € M be fixed. (The exact value of e„^p will be chosen later.) For 
each = (7, A) S with 7 = (71 , . . . , 7^ ) € F and A = (Ai , . . . , A^-) G A, we 
associate 9 with a covariance matrix S(0) by 

r 

(19) ^(g) =Ip + en,p 7m^m(Am)- 

m=l 
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It is easy to see that in the Gaussian case IllXipxplll ^ T is a sufficient condition 
for (2). Without loss of generahty we assume that r > 1 in the subgaussianity 
assumption (2); otherwise we replace Ip in (19) by dp with a small constant 
c > 0. Finally we define a collection T^, of covariance matrices as 

(20) = I S(0) : S(0) = Ip + en,p ^ 7mA^{Xm),0 = (7, A) G 6 I. 

I m=l J 

Note that each S € -F* has value 1 along the main diagonal, and contains an 
r X r submatrix, say, A, at the upper right corner, at the lower left corner 
and elsewhere. Each row of A is either identically (if the corresponding 
7 value is 0) or has exactly k nonzGro clGments with. valuG c^p- 

Wg now specify the values of Cn^p and k to ensure CZ Qq{cn^p^ • Set 

^n,p = '^\f^^ foi' ^ fixed small constant u, and let k = max( \\cn,pen%~\ — Ij 0) 
which implies 

max ^ IcJijl^ < 2ke'l p < Cn,p- 



We require 



(21) 0<t;< 



i<i<p . 



xmu{ -, r — 1 [> — 



1/(1-.) ^ ^ ^ . 



and V < 



54/3 



Note that e^.p and A; satisfy 

(22) 2ken,p < Cn,peij < Mu^"'? < min{i, r - 1} 

and consequently every S(^) is diagonally dominant and positive definite, 
and |||S(0)||| < |||S(0)|||i < 2fce„^p + 1 < r. Thus we have J> C Qq{cn^p), and 
the subgaussianity assumption (2) is satisfied. 

Step 2: Applying the general lower bound argument. Let Xi, . . . , X„ 
N{0,'E,{9)) with 9 ^ Q and denote the joint distribution by ¥g. Applying 
Lemma 3 to the parameter space with s = 2, we have 

(23) infmax22Ex|0|||S-S(6')|f >a-^- min ||P,,o A Pi,i||, 
where 

II|S(^)-S(^')IP 
: H(7(e),7(e'))>i} H{-f{e) , 7(0')) 

and Pj^o and Pj^i are defined as in (9). 

Step 3: Bounding the affinity and per comparison loss. We shall now bound 
the two factors a and miuj ||Pi,o AlPj^i|| in (23). This is done separately in 
the next two lemmas which are proved in detail in Section 7. Lemma 5 gives 
a lower bound to the per comparison loss, and it is easy to prove. 
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Lemma 5. For a defined in equation (24) we have 

a> 



The key technical difficulty is in bounding the affinity between the Gaus- 
sian mixtures. The proof is quite involved. 

Lemma 6. Lei Xi, . . . ,X„ ~ ■ N{0,T,{9)) with e G defined m equa- 
tion (18), and denote the joint distribution by ¥g. For a € {0, 1} and 1 <i < 
r, define Pj^a as in (9). Then there exists a constant ci > such that 

min llPi.o APi,i|| >ci. 

l<i<r ' 

Finally, the minimax lower bound for estimation over Qq{cn^p) is obtained 
by putting together the bounds given in Lemmas 5 and 6, 

inf sup Ex|eI1|S-I;|P >inf max ExieHlS - S(6')|f 
s SGe,(cn.p) s 2{e)eJ-. 

> — • - • Cl 



1-9 



2 /'logP 



n 



for some constant C2 > 0. □ 



Remark 2 . It is easy to check that the proof of Theorem 2 also yields a 
lower bound for estimation under the general matrix iy^ operator norm for 

any 1 < w < oo, 

III 2 f 2 f^'^SP\^ logp\ 
mf sup Ex|e S-SL>c c H 

by applying Lemma 3 with s = 1. 

4. Minimax upper bound under the spectral norm. Section 3 developed 
a minimax lower bound for estimating a sparse covariance matrix under the 
spectral norm over Qqi^Cn^p^- In this section we shall show that the lower 
bound is rate-sharp and therefore establish the optimal rate of convergence. 
To derive a minimax upper bound, we shall consider the properties of a 
thresholding estimator introduced in Bickel and Levina (2008b). Given a 
random sample {Xi, . . . ,X„} of p-variate observations drawn from a distri- 
bution in 'Pq{T,Cn,p), the sample covariance matrix is 



1 " 

1=1 
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which is an unbiased estimate of S, and the maximum hkehhood estimator 
of S is 

1 " 

(25) S* = = - 5](X, - X)(X, - X)^ 

1=1 

when X;'s are normahy distributed. These two estimators are close to each 
other for large n. We shall construct estimators of the covariance matrix S 
by thresholding the maximum likelihood estimator S*. 
Note that the subgaussianity condition (2) implies 



= sup Var[v^(Xi -EXi)] < / 

v:||v|| = l Jo 



oo 

e-^/(2r) ^ 2r. 



Then the empirical covariance a* ^ satisfies the following large deviation 
result that there exist constants Ci> and 7 > such that 

(26) P(|cj*,. - aij\ >t)< Ciexp(^--^nt2^ 

for |t| < 5, where Ci, 7 and 5 are constants and depend only on r. See 
Saulis and Statulevicius (1991) and Bickel and Levina (2008a). Inequality 
(26) implies a^j behaves like a subgaussian random variable. In particular 



for t = 7^ we have 

(27) Fi\a*j-a,,\>t)<Cip-\ 
Define the thresholding estimator S = {(Jij)py,p by 



(28) a,,=a*../(KI>7 



n 



This thresholding estimator was first proposed in Bickel and Levina (2008b) 
in which a rate of convergence of the loss function in probability was given 
over the uniformity class Qq{cn,p)- Here we provide an upper bound for mean 
squared spectral norm error over the parameter space Qq{cn,p)- 

Throughout the rest of the paper we denote by C a generic positive con- 
stant which may vary from place to place. The following theorem shows that 
the thresholding estimator defined in (28) is rate optimal over the parameter 

space Gq{Cn,p)- 

Theorem 3. The thresholding estimator S given in (28) satisfies, for 
some constant C > 0, 



(29) sup Ex|e|||S-Sr <C 



logp\^ \ogp 
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Consequently, the minimax risk of estimating the sparse covariance matrix 
S over Qq{cn,p) satisfies 

(30) inf sup Ex,|||S-S|Pxc^ii^^)"Vi^^. 

Remark 3. A similar argument to the proof of equation (29) in Sec- 
tion 7.4 yields the following upper bound for estimation under the matrix 
i\ norm: 

1-9 



sup ExielllS-Slll? <C 



logp\ logp 



Theorem 3 shows that the optimal rate of convergence for estimating a 
sparse covariance matrix over Qq{cn,p) under the squared spectral norm is 
Cn^pi^^^y^'^ ■ In Bickel and Levina (2008b) the uniformity class Gq{cn,p) 
defined in (16) was considered. We shall now show that the same minimax 
rate of convergence holds for estimation over Gq{cn,p)- It is easy to check in 
the proof of the lower bound that for every S G 7-"^, defined in (20), we have 

max \aij\" < 2kel < Cn,p 

and consequently C Qq{cn,p)- Thus the lower bound established for J'^, 
automatically yields a lower bound for Qq{cn,p)- On the other hand, since a 
strong Iq ball is always contained in a weak Iq ball by the Markov inequality, 
the upper bound in equation (29) for the parameter space Qq also holds for 
Gl{cn,p)- Let 'P*{T,Cn,p) denote the set of distributions of Xi satisfying (2) 
and with covariance matrix S G Gq{cn,p)- Then we have the following result. 

Proposition 1. The minimax risk for estimating the covariance matrix 
under the spectral norm over the uniformity class Gq{cn,p) satisfies 

logpX^"*^ logp 



(lo^ p "\ 
+ 



n 



The thresholding estimator S defined by (28) is positive definite with high 
probability, but it is not guaranteed to be positive definite. A simple addi- 
tional step can make the final estimator positive semi-definite and achieve 
the optimal rate of convergence. Write the eigen-decomposition of S as 



iVivf, 
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where Aj's and Vi's are the eigenvalues and eigenvectors of S, respectively. 
Let = max(Aj,0) be the positive part of Aj and define 

1=1 

Then 

- < - + Ills - S||| < max |Ai| + |||S - S||| 

't:Ai<0 

< max |Ai - Ail + |||S - S||| < 2|||S - S|||. 

i:\i<0 

The resulting estimator S"^ is positive semi-definite and attains the same 
rate as the original thresholding estimator S. This method can be applied 
to the tapering estimator in Cai, Zhang and Zhou (2010) as well to make 
the estimator positive semi-definite, while still achieving the optimal rate. 

5. Optimal estimation under Bregman divergences. We have so far fo- 
cused on the optimal rate of convergence under the spectral norm. In this 
section we turn to minimax estimation of sparse covariance matrices under a 
class of Bregman divergence losses which include Stein's loss, Probenius norm 
and von Neumann's entropy as special cases. Bregman matrix divergences 
have been used for matrix estimation and matrix approximation problems; 
see, for example, Dhillon and Tropp (2007), Ravikumar et al. (2008) and 
Kulis, Sustik and Dhillon (2009). In this section we establish the optimal 
rate of convergence uniformly for a class of Bregman divergence losses. 

Bregman (1967) introduced the Bregman divergence as a dissimilarity 
measure between vectors, 

i^^(x,y) = </.(x) - (/.(y) - (V(/)(y))^(x - y), 

where i;^ is a differentiable, real-valued, and strictly convex function defined 
over a convex set in a Euclidean space M"*, and is the gradient of (p. The 
well-known Mahalanobis distance is a Bregman divergence. This concept 
can be naturally extended to the space of real and symmetric matrices as 

D^{X, Y) = 4>{X) - ^{Y) - tii{V<P{Y)f{X - Y)] , 

where X and Y are real symmetric matrices, and ^ is a differentiable strictly 
convex function over the space. See Censor and Zenios (1997) and Kulis, 
Sustik and Dhillon (2009). A particularly interesting class of (p is 

(31) </>(X) = f](^(A,), 

i=l 

where Aj's are the eigenvalues of X, and ip is a differentiable, real- valued, and 
strictly convex function over a convex set in M. See Dhillon and Tropp (2007) 
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and Kulis, Sustik and Dhillon (2009). Examples of this class of Bregman 
divergences include: 

• ip{X) = —log A, or equivalently (p{X) = — logdet(X). The corresponding 
Bregman divergence can be written as 

D^xx) = tr(xy-i) - logdet(xy-i) -p, 

which is often called Stein's loss in the statistical literature. 

• (p{X) = A log A — A, or equivalently (j){X) = tr(XlogX — X), where X is 
positive definite such that \ogX is well defined. The corresponding Breg- 
man divergence is the von Neumann divergence 

D^{x, Y) = tiix log X - X log y - X + y ) . 

• ip{X) = A^, or equivalently (j){X) = tr(X^). The resulting Bregman diver- 
gence is the squared Frobenius norm 

D^{X,Y) = tr[(X - Yf] = \\\X - Y\\\l = J^(x,, - y,jf 
for X = {xij)i<ij<p and Y = {yij)i<i,j<p- 

Define a class ^ of functions (p satisfying the following conditions: 

(1) (p is twice differentiable, real-valued and strictly convex over A € 
(0,oo); 

(2) |(/3(A)| < CX^ for some C > and some real number r uniformly over 
A G (0,oo); 

(3) For every positive constants €2 and M2 there are some positive con- 
stants cl and Cu depending on €2 and M2 such that cl <ip (A) < for all 
Ae [e2,M2]. 

In this paper, we shall consider the following class of Bregman divergences: 
(32) $=|0(S)= J^v^(A,):^G^'|. 

It is easy to see that Stein's loss, von Neumann's divergence and the squared 
Frobenius norm are in this class. 

Let ei > be a positive constant. Let 'P^(r, c„^p) denote the set of distri- 
butions of Xi satisfying (2) and with covariance matrix 

S G Qq{Cn,p) = Gq{Cn,p) H {S : Amin > Cl}- 

Here Amin denotes the minimum eigenvalue of S. The assumption that all 
eigenvalues are bounded away from is necessary when (p{\) is not well 
defined at 0. An example is the Stein loss where ^{\) = — log A. Under this 
assumption all losses are equivalent to the squared Frobenious norm. 
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The following theorem gives a unified result on the minimax rate of con- 
vergence for estimating the covariance matrix over the parameter space 
Vq {T,Cn,p) for all Bregman divergences € $ defined in (32). 



Theorem 4. Assume that Cn,p < Mn^^ ''^/^(logp) '^^/^ for some M > 



and <q<l. The minimax risk over V? {r^Cn^p) under the loss function 



L^(S,S) = iD<^(S,S) 
p 



for all Bregman divergences </> G <l> defined in (32) satisfies 

logp^ ""^"^^^ 1 



(logp\ ^ '^^'^ 
+ 



n 



Note that Theorem 4 gives the minimax rate of convergence uniformly 
under all Bregman divergences defined in (32). For an individual Bregman 
divergence loss, the condition that all eigenvalues are bounded away from 
is not needed if the function ip is well behaved at 0. For example, such is the 
case for the Frobenius norm. 

The optimal rate of convergence is attained by a modified thresholding 
estimator. Let S = (<5'ij)i<j,j<p be the thresholding estimator given in (28). 
Define the final estimator of S by 

/o.^, V J ^' 7 < Amin(S) < max{logn,logp}, 

(34j Zjb=< max|logn,logp| 

/, otherwise. 

It will be proved in Section 7.5 that the estimator given in (34) is rate 
optimal uniformly under all Bregman divergences satisfying (32). Note that 
the modification of S given in (34) is needed. Without it, the loss L(^(S, S) 
may not be well behaved under some Bregman divergences such as Stein's 
loss and von Neumann's divergence. 

Remark 4. Let V*^{t, Cn,p) denote the set of distributions of Xi satis- 
fying (2) and with covariance matrix S € Gg^{cn,p) = Qq{cn,p) H {S : Xmin > 
ei}. Then under the same conditions as in Theorem 4, 

logp^ ""^ ''^^ 1 



(logp\ ^ ''^^ 
+ 



n 



6. Discussions. The focus of this paper is mainly on the optimal estima- 
tion under the spectral norm. However, both the lower and upper bounds 
can be easily extended to the general matrix norm for 1 < w < oo by 
using similar arguments given in Sections 3 and 4. 
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Theorem 5. Under the assumptions in Theorem 1, the minimax risk of 
estimating the covariance matrix S under the matrix iyj-norm for 1 <w <oo 
over the class Vq{T,Cn,p) satisfies 



Moreover, the thresholding estimator S defined in (28) is rate-optimal. 

As noted in Section 3, a rate-sharp lower bound for the minimax risk 
under the ^w norm can be obtained by using essentiahy the same argu- 
ment with the same parameter space J-.^ and a shghtly modified version of 
Lemma 5. The upper bound can be proved by applying the Riesz-Thorin 
interpolation theorem, which yields |||j4|||^ < max{|||j4|||i ,|||A|||2, |||A|||oo} for all 
w G [l,oo), and by using the facts |||^|||i = |||^|||oo and |||^|||2 < when A 

is symmetric. In Section 4 we have in fact established the same risk bound 
for both the spectral norm and matrix £i-norm. 

The spectral norm of a matrix depends on the entries in a subtle way 
and the "interactions" among different rows/columns must be taken into 
account. The lower bound argument developed in this paper is aimed at 
treating "two-directional" problems by mixing over both rows and columns. 
It can be viewed as a simultaneous application of Le Cam's method in one 
direction and Assouad's lemma in another. In contrast, for sequence esti- 
mation problems, we typically need one or the other, but not both at the 
same time. The lower bound techniques developed in this paper can be used 
to solve other matrix estimation problems. For example, Cai, Liu and Zhou 
(2011) applied the general lower bound argument to the problem of estimat- 
ing sparse precision matrices under the spectral norm and established the 
optimal rate of convergence. This problem is closely connected to graphical 
model selection. The derivations of both the lower and upper bounds are 
involved. For reasons of space, we shall report the results elsewhere. 

In this paper we also developed a unified result on the minimax rate of 
convergence for estimating sparse covariance matrices under a class of Breg- 
man divergence losses which include the commonly used Frobenius norm as 
a special case. The optimal rate of convergence given in Theorem 4 is iden- 
tical to the minimax rate for estimating a row/column as a vector with the 
weak ball constraint under the squared error loss. Our result shows that 
this class of Bregman divergence losses are essentially the same and thus 
can be studied simultaneously in terms of the minimax rate of convergence. 

Estimating a sparse covariance matrix is intrinsically a heteroscedastic 
problem in the sense that the variances of the entries of the sample covariance 
matrix are not equal and can vary over a wide range. A natural approach is to 
adaptively threshold the entries according to their individual variabilities. 
Cai and Liu (2011) considered such an adaptive approach for estimation 



(35) 
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over the weighted iq bahs which contains the strong ig bahs as subsets. 
The lower bound given in Proposition 1 in the present paper immediately 
yields a lower bound for estimation over the weighted £q balls. A data-driven 
thresholding procedure was introduced and shown to adaptively achieve the 
optimal rate of convergence over a large collection of the weighted iq balls 
under the spectral norm. In contrast, universal thresholding estimators are 
sub-optimal over the same parameter spaces. 

In addition to the hard thresholding estimator used in Bickel and Levina 
(2008b), Rothman, Levina and Zhu (2009) considered a class of thresholding 
rules with more general thresholding functions, including soft thresholding 
and adaptive Lasso. It is straightforward to show that these thresholding 
estimators with the same choice of threshold level used in (28) also attains 
the optimal rate of convergence over the parameter space Qq{cn,p) under 
mean squared spectral norm error as well as under the class of Bregman 
divergence losses considered in Section 5 with the same modification as in 
(34). Therefore, the choice of the thresholding function is not important as 
far as the rate optimality is concerned. 

7. Proofs. In this section we prove the general lower bound result given 
in Lemma 3, Theorems 3 and 4 as well as some of the important technical 
lemmas used in the proof of Theorem 2 given in Section 3. The proofs of a 
few technical results used in this section are deferred to the supplementary 
material [Cai and Zhou (2012)]. Throughout this section, we denote by C a 
generic constant that may vary from place to place. 

7.1. Proof of Lemma 3. We first bound the maximum risk by the average 
over the whole parameter set, 

max2^Ex|eci^(r,V(e)) > ^ J] 2^Ex|ed^(r, V'(^)) 

(36) ^ ^ ' 

^ e 

Set 9 = wgvamQ^Q(P[T,'il)[9)). Note that the minimum is not necessarily 
unique. When it is not unique, pick to be any point in the minimum set. 
Then the triangle inequality for the metric d gives 

^^\ed'm)MO)) < ^i^\e[dm),T) + d(T, ^{9))]' 

(37) 

<Ex|e[2d(^,^(0))]^ 

where the last inequality is due to the fact d{ip{9),T) = d{T, '4>{9)) < d{T, Tp{9)) 
from the definition of 9. Equations (36) and (37) together yield 

max2^Ex|erf''(T,V(^)) > :^Y.^^\ed'{m,m) 
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(38) ^24rE^x.-»f»--H(7W..W) 

^ e 

where the last step follows from the definition of a in equation (11). 
We now show 



(39) 



}—Y^^^^QH{^{e),^{e)) > ^min||P,,o APi,i| 



which immediately implies maxe 2'*Ex|6if^'^(7'5 '0(^)) ^ Q;§ ™ii^j ll^i.o /\ IPj,!! 
and Lemma 3 follows. Prom the definition of H in equation (6) we write 

T 

i^5^Ex|e^(7(^),7(e)) = ^EE^x|el7^(^) -7.(^)1- 



i=l 



The right-hand side can be further written as 
E^Ef E J^x|.l7.(^)-7 



2'' -Da 

i=i per ^{e:^,{e)=p] 



E E [(^-^mdF^ 

^ {p:p, = l}{9-M9)=p}-^ 

^E[/^.w(^ E E 

+/(i-7.w)(^ E E_ *») 



{p:p, = l}{e:7(e)=p} 

7.(0)a!P.,o+ / (l-7i(^))'^P*,i 



1 

2 ^ 

i=l 

The following elementary result is useful to establish the lower bound for 
the minimax risk. See, for example, page 40 of Le Cam (1973). 

Lemma 7. The total variation affinity satisfies 



A( 
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It follows immediately from Lemma 7 that 



1 



1 - 

4 = 1 

>^min||P,,oAPi,i| 



and so equation (39) is established. 



7.2. Proof of Lemma 5. Let v = (vi) be a column p- vector with Vi = 
for 1 <i <p — r and f j = 1 for j) — r + l<i<p, that is, u = {l{p — r + 1 < 
^ <p})pxi- Set w = (wi) = [S(0) — T,{6')]v. Note that for each i, if |7i(0) — 
7i(^')l = 1; have {wil = ken,p- Then there are at least H{'y{9),^{9')) num- 
ber of elements Wi with 11^11 = k€n,p, which implies 

\\[m-n0')]v\\l>H{^{e)Mo'))-{ken,pf. 

Since = r <p, the equation above yields 

\\\m - s(eof > > g(7(^),7(^o)-(fce.,,)^ ^ 

that is, 

|||S(0)-S(^')IP ^ (^e„,p)2 



> 



when H{j{9),-/{9')) > I. 

7.3. Proof of Lemma 6. The proof of the bound for the affinity given in 
Lemma 6 is involved. We break the proof into a few major technical lemmas 
which are proved in Section 7.3 and the supplementary material. Without 
loss of generality we consider only the case i = \ and prove that there exists 
a constant ci > such that ||lPi,o APi^i|[ > ci. The following lemma is the 
key step which turns the problem of bounding the total variation affinity 
into a chi-squared distance calculation on Gaussian mixtures. 

Lemma 8. (i) There exists a constant C2 < 1 such that 

(ii) Moreover, equation (40) implies that ||Pi,o APi^i|| > 1 — C2 > 0. 

The proof of Lemma 8(ii) is relatively easy and is given in the supple- 
mentary material. Our goal in the remainder of this proof is to establish 
(40), which requires detailed understanding of IP(i o^-y.^^A-i) ^'Hd the mixture 
distribution IP(i,i,7_i,a_i) as well as a careful analysis of the cross-product 
terms in the chi-squared distances on the left-hand side of (40). 
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From the definition of 9 in equation (12) and IP(i,o,7_i,A_i) in equation 
(13), 7i = impUes P'(i,o,7_i,A_i) is a single multivariate normal distribution 
with a covariance matrix, 

(41) ^o=(. ' '^^(^-^^ 

\'J{p-l)xl '3(p-l)x{p-l) 

Here S(p_i)x(p-i) = i^ij)2<i,j<p is a symmetric matrix uniquely determined 
by (7-1, A_i) = ((72,...,7r),(A2,. . . , A^)) where for i<j, 

1, i = j, 

en,p, 7i = Aj(j)=l, 

0, otherwise. 
Let 

Ai{c) = {aeB:3eeQ such that Xi{e) = a and X-i{9) = c}, 

which gives the set of all possible values of the first row with the rest of 
the rows fixed, that is, X-i{0) = c. Let nx_-^ be the number of columns of 
A_i with the column sum equal to 2k for which the first row has no choice 
but to take value in this column. Set px_^ = r — nx_i- It is helpful to 
observe that p\_-^ > p/4 — 1. Since nx_-^ ■ 2k < r ■ k, the total number of Is 
in the upper triangular matrix by the construction of the parameter set, we 
thus have nx_i < r/2, which immediately implies Pa_i = r — nx_^ > r/2 > 
p/4 — 1. It follows Card(Ai(A_i)) = (^^-i). Then, from the definitions in 

equations (12) and (13), P(i,i,'y_i,A_i) is an average of C^^^) multivariate 
normal distributions with covariance matrices of the following form: 



(42) 



1 rix(p-i) 



where |[r||o = k with nonzero elements of r equal e„^p and the submatrix 
S(p-i)x(p-i) is the same as the one for Sq given in (41). 

Recall that for each G 0, is the joint distribution of the n i.i.d. 
multivariate normal variables Xi, . . . ,X„. So each term in the chi-squared 
distance on the left-hand side of (40) is of the form (J 3i9i'jn -^j^gpg g. a,re 

the density function of A^(0, Sj) for i = 0,l and 2, with Sq defined in (41) 
and Si and S2 of the form (42). 

The following lemma is useful for calculating the cross product terms 
in the chi-squared distance between Gaussian mixtures. The proof of the 
lemma is straightforward and is thus omitted. 

Lemma 9. Let gi be the density function of N{0, Sj) for i = 0,l and 2, 
respectively. Then 

[9192^ ^^^^^j _ ^-2^^^ _ ^^^^^^ _ 5]o))]-V2. 
J 90 
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Let So be defined in (41) and determined by (7_i,A_i). Let Si and S2 
be of the form (42) with the first row Ai and A'^, respectively. Set 

(43) Rl;"x[~' = -logdet(/ - So2(So - Si)(So - S2)). 

We sometimes drop the indices (Ai, A'^) and (7-1, A_i) from Sj to simplify 
the notation whenever there is no ambiguity. Then each term in the chi- 
squared distance on the left-hand side of (40) can be expressed as in the 
form of 

-p(r«l:,A^)-i- 

Define 

e_i(ai, 02) = {0, If-^ {c G A_i : 30i G e, i = 1, 2, 

such that Ai(0i) = ttj, A_i(0j) = c}. 

It is a subset of 0_i in which the element can pick both ai and 02 as the first 
row to form parameters in Q. From Lemma 9 the average of the chi-squared 
distance on the left-hand side of equation (40) can now be written as 

^(7-i,A-i)|E(Ai,A'i)|A„i 

(44) 

= ]E(Ai,A',)|%_i,A_i)|(Ai,a;) 

where Ai and A'^ are independent and uniformly distributed over Ai(A_i) 
(not over B) for given A_i, and the distribution of (7-1, A_i) given (Ai, A'^^) 
is uniform over 0_i (Ai,A^), but the marginal distribution of Ai and A'^ are 
not independent and uniformly distributed over B. 

Let Si and S2 be two covariance matrices of the form (42). Note that Si 
and S2 differ from each other only in the first row/column. Then Sj — Sq, 
i = 1 or 2, has a very simple structure. The nonzero elements only appear 
in the first row/column, and in total there are at most 2k nonzero elements. 
This property immediately implies the following lemma which makes the 
problem of studying the determinant in Lemma 9 relatively easy. The proof 
of Lemma 10 below is given in the supplementary material. 

Lemma 10. Let Sq be defined in (4I) and let Si and S2 be two covari- 
ance matrices of the form (^2). Define J to be the number of overlapping 
€n,p 's between Si and S2 on the first row, and 

Q = {lij)l<i,'j<p = (^1 ~ So)(^2 — So). 



exp( --R' 



Ai,A'i 



expf ^ • R 



7-l:A-l 

Ai,A'i 
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There are index subsets Ir and Ic in {2, ■ ■ ■ ,p} with Card(/r) = Card(/c) = k 
and Card(/r Die) = J such that 



otherwise, 



and the matrix (Sq — Si)(Eo — S2) /las ran/c 2 mi/i two identical nonzero 
eigenvalues Je^p when J > 0. 

The matrix Q is determined by two interesting parts, the first element 
511 = Je^p and a very special k x k square matrix {qij -.i G Ir and j G Ic) 
with all elements equal to e^^. The following result, which is proved in the 
supplementary material, shows that ^"^"^ is approximately equal to 

-logdet(/ - (So - Si)(So - S2)) = -21og(l - Je^p), 
where J is defined in Lemma 10. Define 

^i,J = {(-^1; -^'1) ^ B ® B: the number of overlapping e^^p's between 

Ai and A'l is J}. 



Lemma 11. Let Rl_^\^ ^ be defined in equation (43). Then 



(45) 



R 



,7-1, A_ 
Ai,A' 



r- = -2iog(i-j6^,p)+i?^--'- 



where ^^^y ^ satisfies, uniformly over all J, 



(46) 



E 



(Ai,A'i)|J 



E 



(7- 



_i,A_i)|{Ai,A;)expl -K^ 



- 2 



With the preparations given above, we are now ready to establish equation 
(40) and thus complete the proof of Lemma 6. 

Proof of equation (40). Equation (45) in Lemma 11 yields that 



E, 



(Ai,A',)<|E(^_,,A_i)|(Ai,A'i) 



»p(f«I;,'v:-)-^} 



Ejl exp[-nlog(l - Je^^p)] 



X E 



(Ai,Ai)|J 



E 



(7-i,A-i)|{Ai,Ai)expl -/t!^ 



1 . 
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Recall that J is the number of overlapping e^^p's between Si and S2 on 
the first row. It is easy to see that J has the hypergeometric distribution as 
Ai and A'^ vary in B for each given A_i. For < j < A;, 



]Ej(l{J = j}|A_i) 



j [ k-j / [ k 



(47) _ {kl/{k-j)ir 



< 



{px_APX-,-2k + j)l)/[{px_,-k)l]^ jl 
k^ V 



where ^^^'^^.^i is a product of j term with each term < k and for [(p^^ 1^)!]^"^'^^' 

it is bounded below by a product of j term with each term > p\_-^ — j. Since 
> p/4 — 1 for all A_i, we have 



E(l{ J = j}) = Ex_, [Ej(l{ J = j}|A_i)] < 

Thus 

^^(1, 1,7-1, A_i 



fc2 



p/4-l-k 



E(7-i,A-.) " ^IP{1,0,.-„A.,) - 1 



(l,0,7-i,A_i) 



(48) < E(^7i^^)'{-p[-i°g(i -^-^p)] • r 4 

= ^E( p/4-1-A: ) ^M2j{v'logp)] 

+ (^yj^)°{exp[-nlog(l - . 4,,)] • ^ - 1} 

i>i j>i 
by setting = 3/4, where the last step follows from < 4^ and k'^ 



Oii^) = Oi^) as defined in Section 3. 



Remark 5 . The condition p>n^ for some /3 > 1 is assumed so that 

A;2 A;2 0(n/ logp) _ 
E = — 771 — J— = ''iP ) 



p\_^—k p/4 — k p/A — k 
for some e > to make the term (48) to be o(l) 
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7.4. Proof of Theorem 3. The following lemma, which is proved in Cai 
and Zhou (2009), is now useful to prove Theorem 3. 

Lemma 12. Define the event Aij by 



(49) 
Then 



^ij - CTij] < Amin< |o-jj|,7 



logp 



n 



) > l-2Cip"9/2, 



Let D = idij)i<ij<p with dij = {aij - aij)I{A'^-). Then 



Ex|e|||S - S|| 



(50) 



< 2E 



X|6 



sup^ |<Ty - aij\I{Aij) 



+ 2Ex,|P|||? + ci°^^ 



n 



<32 



sup^min<^ kiil,7 



logp 



n 



+ 2Ex,|P|||? + ci°S^ 



n 



We will see that the first term in equation (50) is dominating and is 
bounded by Cc^ p(i^^)-^~'', while the second term ExielH-DlHf is negligible. 
Set = Lc„,p(i^)'?/2J. Then we have 



^min<j |crjj|,7 



logp 



n 



ST E + E 



;<fc* i>k' 



mm< cr 



< C^k* 



* /logp 



n 



(51) 



i>k* ^ ^ 



logp 



n 



1/9 



<C6 



logp 



n 



(l-</)/2 



which immediately implies equation (29) if Ex|6i|||^|||i = We shall now 

show that Ex|e|||-D|||i = 0(i). Note that 



= pE^x|a[4^(^?? n {a,, = a*,}) + 4/(^^ n {a,, = 0})]} 
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= i?l + R2- 



Lemma 12 yields that W{A'i-) < 2Cip~^/^, and the Whittle inequality im- 
plies a*- — (Tij has all finite moments [cf. Whittle (I960)] under the subgaus- 
sianity condition (2). Hence 

<Csp-p^--- p"^ = Cs/n. 
n 

On the other hand, 

R2 =p5^Exi94/(A^,- n {<Ti,- = 0}) 



h| <7 



ij 
ij 



logp 



n 



E 



nafj ■ Ci exp 



• exp 



logp 



n 



<Cs^.p^.p-^^<Cg/n. 
n 

Putting Ri and R2 together yields that for some constant C > 
(52) 

Theorem 3 is proved by combining equations (50), (51) and (52). 



Ex|(?IPIII?<-- 

' n 



7.5. Proof of Theorem 4- We establish separately the lower and upper 
bounds under the Bregman divergence losses. The following lemma relates 
a general Bregman divergence to the squared Frobenius norm. 
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Lemma 13. Assume that all eigenvalues of two symmetric matrices X 
and Y belong to [e2,M2]. Then there exist constants C2 > ci > depending 
only on €2 and M2 such that for all (j) defined in (32), 

cifX - Yfp < D^{X, Y) < C2\\\X - Yfp. 

Proof. Let the eigen decompositions of X and Y be 

p p 
X = Xivfvi and Y = jiujui. 

i=l i=l 

For every = Yl^=i vi^i) it is easy to see that 

(53) D4X,Y) = Y,ivIu^)'HX^) - ^(7,) - V>Xlj) ■ (A. - 7,)]- 

id 

See Kuhs, Sustik and Dhillon (2009), Lemma 1. The Taylor expansion gives 
D^X, Y) = J2(^Iuif^y,"{^,,){X, - 7,)', 

where is in between Aj and 7^ and then contained in [52,^2]. From the 
assumption in (32), there are constants cl and Cu such that cl < ^"W < Cu 
for all A in [£2,^2], which immediately implies 



^CLY,{vfu,f{X,-j,f<D4X,Y) 

< \cu Y,i.vJui)\X, - 7,)' = 111^ - Yll 

or equivalently 



2 



\clIX - Yll < D4X,Y) < lcu\\\X - Yfp. 
Lower bound under Bregman matrix divergences. It is trivial to see that 

inf sup Ex|eL(/)(^i 5](^)) > c— 

by constructing a parameter space with only diagonal matrices. It is then 
enough to show that there exists some constant c > such that 

/ logp"^ ^ 

inf maxEx|eL9i(5^, 5](6')) > cc„,p 

s J"* ' \ n 

for all (/> G $ defined in (32). Equation (53) implies 



(54) infmaxEx|9L<^(S,S(6')) = inf maxEx|0L^(S, S(6l)). 
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Convexity of 99 implies (p{Xi) — ^{'yj) — ^'{ij) • (Aj — 7j) is nonnegative and 
increasing when Aj moves away from the range [ei,2r] of those eigenvalues 
7j's of From Lemma 13 there is a universal constant cl such that 

inf maxEx|0L^(S,S(6l)) 

1 . 2 

>cl inf maxEx|e-|||S — 

= ^infmaxEx|e|||S-5^WlllF> 

where the last equality is from the same argument for equation (54). 

It then suffices to study the lower bound under the Frobenius norm. Sim- 
ilar to the lower bound under the spectral norm one has 

irifni^22Ex|,|||S-S(0)|||^ 

> min "' \, ,7 -min P^oAPii ■ 



It is easy to see 

^^.^^ iiis(g)-£(goiiii /logpy-"/' 

and it follows from Lemma 6 that there is a constant c > such that 

min||Pi,o APi,i|| >c. 

i 

Upper hound under Bregman matrix divergences. We now show that there 
exists an estimator S such that 



(55) Ex|eL^(S,S)<c 



logp\"^ ''^^ 1 



n I n 



some constant c> 0, uniformly over all i;^) G <^ and S G Vg^T, Cn,p)- Let ^0 = 
Pl^ j^jj, where is defined in (49). Lemma 12 yields that 

(56) P(ylo) > 1 - 2Cip~^/^ □ 



Lemma 14. Let Tjb be defined in equation (34)- Then for aWE {p,c. 



n,p J 
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Proof. Write = S + (S^ - S). Since - < - S|||i, the 
lemma is then a direct consequence of Lemma 12 and equation (51) which 
implies \\\±B - < Cc„,p(i^)(i-«)/2 ^ q over Aq. □ 

Lemma 13 implies 
IEx|6iL0(Sb, S) 

(57) 

< CExiej^lllS - S||||/(Ao)| +Ex|aL4^B, S)I(Ag)} 

<16(:7supJ^minj|cj,,f ,7^|+IEx|a^(SB,S)/(^g)} + C-. 

The second term in (57) is negligible since 

Ex|,L^(SB,S){Ag} < C - [max{logn,logp}]l^-| •P(A^) 
< C ■ [max{logn,logp}]l''IC7ip-5/4 



o c, 



1 \ l-<?/2 
logp ^ ' 



n 



by applying the Cauchy-Schwarz inequality twice. We now consider the first 
term in equation (57). Set k* = |_c„^p([^^)'^/^J . Then we have 



^ mini \a^J 7^ + ™^"| ^ | 

ijtj ^ ^ \<k* i>k*'' ^ ^ 



1 / \ 2/g 



n 

't>fc* 



<C74 



*i^ + 4/.r.(r)-2/«' 

n 



which immediately yields equation (55) 



1 \ l-<j/2 



n 



SUPPLEMENTARY MATERIAL 



Supplement to "Optimal rates of convergence for sparse covariance ma- 
trix estimation" (DOT: 10.1214/12-AOS998SUPP; .pdf). In this supplement 
we prove the additional technical lemmas used in the proof of Lemma 6. 
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